Setup

library(limma)
library(edgeR)
library(AnnotationHub)
library(magrittr)
library(scales)
library(tidyverse)
library(pander)
theme_set(theme_bw())

The alignments in this analysis were generated by aligning each library (including technical replicates) to the Zebrafish genome GRCz11 (Ensembl Release 94) using (STAR v2.5.3a). Reads were assigned to each genes based on 100% exonic overlap and unique genomic alignments, based on gene descriptions in Danio_rerio.GRCz11.94.chr.gtf, also obtained from Ensembl Release 94. This set of gene descriptions were then loaded into R as an EnsDb object using the AnnotationHub() infrastructure.

ah <- AnnotationHub() %>%
    subset(species == "Danio rerio") %>%
    subset(rdataclass == "EnsDb")
ensDb <- ah[["AH64906"]]
genes <- genes(ensDb)
counts <- read_tsv("../2_alignedData/featureCounts/genes.out") %>%
    set_names(basename(colnames(.))) %>%
    set_names(str_remove_all(colnames(.), "Aligned.sortedByCoord.out.bam")) 
*Total counts from each library after assigning to genes*

Total counts from each library after assigning to genes

minCpm <- 1
minSamples <- 5
dge <- counts %>%
    gather(key = "Library", value = "Counts", -Geneid) %>%
    mutate(Library = str_replace_all(Library, "__-", "_WT"),
                 Sample = case_when(
                    str_count(Library, "_") == 2 ~ paste0(Library, "_1"),
                    str_count(Library, "_") == 3 ~ Library
                 ),
                 Sample = str_remove_all(Sample, "_[12]$")) %>%
    dplyr::filter(Counts > 0) %>%
    group_by(Geneid, Sample) %>%
    summarise(Counts = sum(Counts)) %>%
    spread(key = Sample, value = Counts, fill = 0) %>%
    as.data.frame() %>%
    column_to_rownames("Geneid") %>%
    magrittr::extract(rowSums(cpm(.) > minCpm) >= minSamples,) %>%
    DGEList(genes = genes[rownames(.)]) %>%
    calcNormFactors()

Extracted RNA from each individual zebrafish was split and run across two NextSeq lanes, giving technical replicates for the sequencing step only. These were combined to provide a single library for each sample, as technical replication at this experimental step is not of particular interest. Genes were retained for analysis if a CPM > 1 was observed for \(\geq\) 5 samples. This equated to about 31 reads for a gene in at least 5 samples for inclusion in downstream analysis, giving a total of 17,962 of the original 32,057 genes for DGE analysis.

dge$samples %<>%
    mutate(Sample = rownames(.),
                 Genotype = str_extract(Sample, "(WT|FAD|FS)"),
                 Genotype = factor(Genotype, levels = c("WT", "FS", "FAD")),
                 group = as.integer(Genotype)) %>%
    set_rownames(.$Sample)
v <- voomWithQualityWeights(dge, design = matrix(rep(1, ncol(dge)), ncol = 1))

Counts were also processed using the voom transformation using quality weights to allow for analysis using normal-based algorithms. Samples weights ranged between 0.3878 and 1.427, with the most down-weigted sample being a WT sample.

Analysis

Data Inspection

The first step for the analysis was to perform an MDS analysis. However, minimal separation was observed between sample groups, with the exception of the most strongly down-weighted sample, which separated very clearly from the remainder of the points. This sample may be a candidate for exclusion if no differentially expressed genes are able to be detected. A simple PCA also revealed that the first few principal components only capture a lesser amount of the total variability than may be expected.

mds <- plotMDS(v, plot = FALSE) %>%
    extract2("cmdscale.out") %>%
    set_colnames(paste0("Dim", 1:2))

MDS plot showing no clear groups within the data. Point sizes indicate sample weights as calculated by voomWithQualityWeights().

First five principal components, showing that the first two only account for 38.0% of the total variance.
  PC1 PC2 PC3 PC4 PC5
Standard deviation 22.19 20.69 15.66 14.61 13.31
Proportion of Variance 0.2031 0.1765 0.1011 0.08807 0.07307
Cumulative Proportion 0.2031 0.3796 0.4807 0.5688 0.6419

Genotype checks

In order to confirm the genotypes of all libraries, three seed sequences were searched in the trimmed reads (i.e. before alignment) which represented the WT, FAD or FS genotypes. The number of matches to each of the seeds was counted for each sample as a simple strategy for confirming genotypes.

Genotypes & Seed sequences use for genotype checking
Genotype Seed
WT GTGCTCAACACTCTGGTC
FAD AACATGATCAGTCTGATC
FS TCGGTGCTCTGGTCATGA
Results from genotype checks based on seed sequences
Sample WT FAD FS Genotype Observed Confirmed
1_FAD_5 4 4 0 FAD FAD TRUE
10_FS_1 6 0 4 FS FS TRUE
11_WT_5 20 0 0 WT WT TRUE
12_WT_4 17 0 0 WT WT TRUE
13_WT_3 23 0 0 WT WT TRUE
14_WT_2 24 0 0 WT WT TRUE
15_WT_1 17 0 0 WT WT TRUE
2_FAD_4 9 11 0 FAD FAD TRUE
3_FAD_3 7 10 0 FAD FAD TRUE
4_FAD_2 12 7 0 FAD FAD TRUE
5_FS_2 13 0 2 FS FS TRUE
6_FAD_1 8 11 0 FAD FAD TRUE
7_FS_5 7 0 1 FS FS TRUE
8_FS_4 0 12 0 FS FAD FALSE
9_FS_3 9 0 1 FS FS TRUE

One sample (8_FS_4) was labelled as an FS sample but presented with the contradictory FAD genotype and this was investgated manually. Notably, no WT or FS reads were found in this sample using this simple strategy.

dge$samples["8_FS_4","Genotype"] <- "FAD"
v$targets["8_FS_4","Genotype"] <- "FAD"

Comparisons

mm <- model.matrix(~0 + Genotype, data = v$targets) %>%
    set_colnames(gsub("Genotype", "", colnames(.)))
cm <- makeContrasts(FSvWT = FS - WT,
                                        FADvWT = FAD - WT,
                                        FADvFS = FAD - FS,
                                        levels = colnames(mm))

Three comparisons were defined with the first two being the difference between the two mutant families and the wild-type samples. The third comparison was defined as being between the two mutant groups.

vFit <- v %>%
    lmFit(mm) %>%
    contrasts.fit(cm) %>%
    eBayes()
colnames(cm) %>%
    sapply(function(x){
        topTable(vFit, x) %>%
            mutate(Comparison = x)
    }, simplify = FALSE)

FS Vs WT